Clustering analysis, model comparison and visualization using data on Chicago Public Schools

Javier Orman
LinkedIn: https://www.linkedin.com/in/javierorman/
GitHub: https://github.com/javierorman

In this notebook, I analyze data on Chicago Public Schools. I obtained the data by scraping a few websites containing survey and income information. The goal is to be able to answer the following questions:

I was able to gather the data using web-scrapers, available on the GitHub of this project.
The school_data information comes from https://5-essentials.org/cps/5e/2018/ and it contains 5 different ratings per school (values 0-100): ambitious instruction, collaborative teachers, effective leaders, involved families and supportive environment. The scraper also obtained the name of each school and full address.
The incomes data was scraped from this website: https://www.zipdatamaps.com/zipcodes-chicago-il. It contains the median household income for each zip code.

Imports

Import packages

Import data

Exploratory Data Analysis

School data

There is a mistake in the data: the zip code for George Washington Carver Military Academy HS is supposed to be 60827, not 60627.

The web-crawling 'spider' inserted -1 for missing values. Schools that didn't have numbers for any of the categories were skipped in the process.

If we allow a maximum of 1 missing value per row, we lose 58 instances... almost 10% of our dataset. With max_nans_allowed set at 2, we only lose 15 instances.

From these summary statistics and the following histograms, we see that there are no invalid values such as negatives, numbers over 100 and NaNs.

Income data

Join dataframes

Scale income

All features except for income have similar ranges: 0 to 100 (or 99 if we want to be precise). In order for some of the clustering algorithms to properly detect patterns in the data, we need income to be in a similar range as the other features.

Although it's impossible to draw conclusions on the data's cluster structure from looking at 1 or 2 variables at a time, we need to note that there are no clear separations that we can establish visually just yet.

Utility Functions

Model Comparison

Scikit-Learn offers implementations for a variety of clustering algorithms. Each one of these has their own hyper-parameters that can be tweaked and their own 'preferences' as to what types of clusters they are able to distinguish. I will provide a short description for each of these algorithms.
We'll also evaluate how each algorithm performs under different hyper-parameter values. Because we're dealing with unsupervised learning, we do not have a 'ground truth' to which we can compare the cluster assignments. However, we can use the silhouette coefficient, which measure how close each instance is to the other instances of their own cluster, compared to the instances in the next closest cluster.
We can interpret the silhouette coefficient as follows:

Silhouette coefficients can be visualized in a silhouette diagram, where samples are grouped by cluster and their silhouette coefficients plotted horizonatally. Ideally, we'd like all clusters to cross the average silhouette coefficient (the silhouette score) and contain as few negative values as possible.

In addition to silhouette diagrams, we can do something much less formal: visualize the cluster assignments on the features themselves. Because we cannot visualize 6 dimensions at a time, we'll just use 3 pairs of features and be content for now. These visualizations should not be used to draw conclusions but merely to partially observe the results of the algorithms.

Once we narrow down our options, we'll compute a 3-D projection of the dataset using t-SNE. Although we will lose all concept of units at that point, we'll be able to visualize the results in a more informative way.

K-Means

Probably the most popular of the bunch, K-Means groups data points using distances. The process amounts to these basic steps:

  1. Initialization: cluster centroids get to random instances. In sklearn, the default initialization is "KMeans++", which places the centroids far from each other.
  2. Instances get assigned to their closest centroids
  3. Update centroids based on the instances that were assigned to them
    Repeat steps 2 and 3 until centroids stop moving.

In K-Means, we need to tell the algorithm how many clusters we'd like to see. We can evaluate different numbers k of clusters with silhouette diagrams.
This algorithm expects clusters to be of globular (spherical) shapes and similar sizes. It also assigns every instance to a cluster, including 'noise' points.

We start by exploring the algorithm with k=5 number of clusters:

Compare KMeans for k values from 2 to 10:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

From sklearn's documentation: DBSCAN finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.

The algorithm finds core instances by determining if they have enough neighbors min_samples within a certain distance epsilon. All instances within in the epsilon-neighborhood of a core distance are assigned the same cluster. An instance that is not in the neighborhood of a core distance is an anomaly.

This algorithm works with clusters of any size and shape. The challenge when using DBSCAN is tuning the hyperparameters eps and min_samples.

In order to find a good value for epsilon, we'll first get an idea of how far the samples are from each other. We can get distance between each instance and its nearest neighbor by using the NearestNeighbors class. This is an unsupervised algorithm that computes and returns, for each instance, the distance to its closest neighbor and the index of that neighbor. We'll then sort and plot these distances.

We can see that most (almost all) of the instances are at a distance of 20 or less from their nearest neighbor. We can safely say that by setting epsilon somewhere between 16 and 21, DBSCAN will find regions of high-density and discard instances that are far away from any dense areas (anomalies). I will choose 18 for now and tweak later if needed.

Explore the outliers

By setting min_samples to 5, we obtained one cluster ('0') with 568 instances and 72 outliers, which were assigned -1. Although we didn't get multiple clusters, this could still be interesting information. By looking at the table below, we can see that many of this outliers had a large spread of values across the different variables. There could be other reasons why these instances are considered anomalous by the algorithm, but is subject to more analysis.

Aggloremative Clustering (aka Hierarchical Clustering)

From Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurélien Géron: A hierarchy of clusters is built from the bottom up. Think of many tiny bubbles floating on water and gradually attaching to each other until there's one big group of bubbles.
A nice aspect of this algorithm is that we can visualize how it works by plotting a dendrogram. As shown below, this lets us visualize the sequential pairing, first of individual instances, then of the combinations of instances until there only few clusters and finally just one. We can then decide how many clusters we want to end up with and compare the results with the silhouette diagrams.

Looking at the dendrogram, it seems like n_clusters should be 2, 3 or 4.

Mean-Shift

Just as with DBSCAN, here we consider the neighborhood of each instance:

  1. We start with circles of radius bandwidth around all instances.
  2. For each instance, we calculate the mean of all instances that fall within their cicle.
  3. Move the circle so that it's centered on that mean (thus the name Mean-Shift)
    Repeat steps 2 and 3 until circles converge. The bandwidth that we choose initially will determine how many clusters we end up with, i.e. at how many places did the circles converge.

Affinity propagation

The description from the hdbscan library documentation is succint: Affinity Propagation is a newer clustering algorithm that uses a graph based approach to let points ‘vote’ on their preferred ‘exemplar’. The end result is a set of cluster ‘exemplars’ from which we derive clusters by essentially doing what K-Means does and assigning each point to the cluster of it’s nearest exemplar.

As with KMeans, this algorithm excepts clusters to be globular. A big plus is that it automatically gets the number k of clusters for us. A drawback is that is very slow.

There are not many hyper-parameters to finetune. We'll try differnt values for preference, which determines how much each instance will "want" to be come an exemplar or representative. The higher the value, the more clusters we'll get.

Gaussian Mixtures

A Gaussian Mixture Model (GMM) starts off with the assumption that the data was produced from a number of multi-variate Gaussian (Normal) distributions. It then tries to find the parameters of those distributions, namely the means and covariate matrices. The latter give the distributions their variance along each dimension as well as the orientation, given by the covariances. Another parameter is a weight that is assigned to each distribution.

The algorithm for finding the distributions is called Expectation-Maximization. After telling the model how many clusters we'd like, these are the basic steps that EM follows:

  1. Initialize the cluster parameters randomly
  2. Assign each instance to a distribution, namely the one that gives it the highest probability of being produced
  3. Update each cluster by looking at the samples it was made responsible for

In sklearn, covariance_type can be set to 'full' (it is by default), which allows the clusters to take any shape, size and orientation. This makes GM extremely flexible.

Bayesian Gaussian Mixture

In this extension of GMMs, the cluster parameters (means, covariance matrices and weights) are themselves treated as latent random variables which come from certain prior distributions whose parameters the algorithm will try to find.

A drawback of this algorithm is that it prefers ellipsoidal shapes, so it won't work well with other data configurations.

Most promising models

From looking at the silhouette scores and diagrams, these seem to be the best models for the task:

KMeans:
n_clusters = 2
n_clusters = 3

AffinityPropagation:
preference = -330000 ---> number of clusters: 2
preference = -170000 ---> number of clusters: 3

GaussianMixture:
n_components = 2
n_components = 3

Next, we'll compare these models using silhouette scores and diagrams. After that, we'll visualize the effects of the individual models on the data using 3-D projections.

t-SNE (t-Distributed Stochastic Neighbor Embedding) reduces dimensionality to a n_components number, usually 2 or 3 when used for visualization. As we reduce the dataset to 3 dimensions in this case (from the original 6) we lose the original units, but still get an idea of which points are close together and which are far apart. We can then color each point by the label that was assigned by the model.

Selected model

The model that seems to perform best on this dataset happens to be the very first one we tried: KMeans with n_clusters=2.
Let's make one last visualization, once again using the t-SNE 3-D projection, but this time using the individual silhouette coefficients to color the points. In order to distinguish between labels, we can make one of the labels negative: for any instance assigned label '0', we multiply their silhouette coefficient by -1.
This way we get a range of [-1, 1] and can use a divergent colormap with samples on the border being lighter.

What's in a cluster?

We can visualize how the features vary between the 2 clusters. As we can see below, income doesn't change much from one to the other. However, the other 5 features are quite a bit higher in cluster 0 than cluster 1, by approximately 30%.

Conclusion

We can now answer the original questions: